{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "<center>\n",
    "<img src=\"https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg\">\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n",
    "Auteur: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edité et traduit par [Ousmane Cissé](https://github.com/oussou-dev.com). Ce matériel est soumis aux termes et conditions de la licence [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). L'utilisation gratuite est autorisée à des fins non commerciales."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## <center> Affectation 4. Détection des sarcasmes avec la régression logistique\n",
    "    \n",
    "Nous utiliserons l'ensemble de données de [paper](https://arxiv.org/abs/1704.05579) \"Un grand corpus auto-annoté pour le sarcasme\" avec 1 million de commentaires provenant du site Reddit, étiquetés comme sarcastiques ou non. \n",
    "\n",
    "Une version complète peut être trouvée sur Kaggle sous la forme d'un [Jeu de données Kaggle](https://www.kaggle.com/danofer/sarcasm).\n",
    "\n",
    "La détection des sarcasmes est facile.\n",
    "<img src=\"https://habrastorage.org/webt/1f/0d/ta/1f0dtavsd14ncf17gbsy1cvoga4.jpeg\" />"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "23a833b42b3c214b5191dfdc2482f2f901118247"
   },
   "outputs": [],
   "source": [
    "!ls ../input/sarcasm/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "ffa03aec57ab6150f9bec0fa56cd3a5791a3e6f4"
   },
   "outputs": [],
   "source": [
    "# some necessary imports\n",
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.pipeline import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b23e4fc7a1973d60e0c6da8bd60f3d921542a856"
   },
   "outputs": [],
   "source": [
    "train_df = pd.read_csv(\"../input/sarcasm/train-balanced-sarcasm.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "4dc7b3787afa46c7eb0d0e33b0c41ab9821c4a27"
   },
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "0a7ed9557943806c6813ad59c3d5ebdb403ffd78"
   },
   "outputs": [],
   "source": [
    "train_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Certains commentaires sont manquants, alors nous supprimons les lignes correspondantes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "97b2d85627fcde52a506dbdd55d4d6e4c87d3f08"
   },
   "outputs": [],
   "source": [
    "train_df.dropna(subset=[\"comment\"], inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Nous remarquons que le jeu de données est bien équilibré"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "addd77c640423d30fd146c8d3a012d3c14481e11"
   },
   "outputs": [],
   "source": [
    "train_df[\"label\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Nous divisons les données en données d'entraînement et données de validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "c200add4e1dcbaa75164bbcc73b9c12ecb863c96"
   },
   "outputs": [],
   "source": [
    "train_texts, valid_texts, y_train, y_valid = train_test_split(\n",
    "    train_df[\"comment\"], train_df[\"label\"], random_state=17\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Les tâches:\n",
    "1. Analysez le jeu de données, faites des graphiques. Ce [kernel](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc) pourrait servir d'exemple.\n",
    "2. Créez un Tf-Idf + un pipeline de régression logistique pour prédire le sarcasme (`label`) en fonction du texte d'un commentaire sur Reddit (` comment`).\n",
    "3. Tracer les mots/bigrammes qui sont les plus prédictifs du sarcasme (vous pouvez utiliser [eli5](https://github.com/TeamHG-Memex/eli5) pour cela)\n",
    "4. (facultatif) ajouter des sous-titres (subreddits) en tant que nouvelles caractéristiques pour améliorer les performances du modèle. Appliquez ici l’approche Sac de mots (Bag of Words), c’est-à-dire traitez chaque sous-titre comme une nouvelle caractéristique.\n",
    "\n",
    "## Liens:\n",
    "  - Bibliothèque d'apprentissage automatique [Scikit-learn](https://scikit-learn.org/stable/index.html) (a.k.a. sklearn)\n",
    "  - Kernels sur la [régression logistique](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification) et ses applications en [classification de texte](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-4-more-of-logit), également un [Kernel](https://www.kaggle.com/kashnitsky/topic-6-feature-engineering- and-feature-selection) sur l'ingénierie et la sélection des caractéristiques (feature engineering and feature selection)\n",
    "  - [Kaggle Kernel](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle) \"Approcher (presque) n'importe quel problème de NLP sur Kaggle\"\n",
    "  - [ELI5](https://github.com/TeamHG-Memex/eli5) pour expliquer les prédictions du modèle"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
